{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LoC Data Package Tutorial: City and Telephone Directories\n", "\n", "This notebook will demonstrate basic usage of using Python for interacting with [data packages from the Library of Congress](https://data.labs.loc.gov/packages/) via the [Directory Holdings Data Package](https://data.labs.loc.gov/directories/) which is derived from the Library's [United States: City and Telephone Directories](https://guides.loc.gov/united-states-city-telephone-directories/introduction) and [Directories By Address: Inventories of Library Collections Library Guides](https://guides.loc.gov/address-directories/criss-cross). We will:\n", "\n", "1. [Read and query metadata from a data package](#Query-the-metadata-in-a-data-package)\n", "2. [Visualize the data](#Visualize-the-data)\n", "\n", "## Prerequisites\n", "\n", "In order to run this notebook, please follow the instructions listed in [this directory's README](https://github.com/LibraryOfCongress/data-exploration/blob/master/Data%20Packages/README.md)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query the metadata in a data package\n", "\n", "First we will download a data package's metadata file, print a summary of the items' location values, then filter by a particular location.\n", "\n", "All data packages have a metadata file in .json and .csv formats. Let's load the data package's City Directories `metadata.json` file:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded metadata file with 56,612 entries.\n" ] } ], "source": [ "import io\n", "\n", "import pandas as pd # for reading, manipulating, and displaying data\n", "import requests\n", "\n", "DATA_URL = 'https://data.labs.loc.gov/directories/'\n", "\n", "metadata_url = f'{DATA_URL}by-directory-type/City Directories/metadata.json'\n", "# Also try: by-directory-type/Criss-cross Directories/metadata.json \n", "# Or: by-directory-type/Telephone Directories/metadata.json \n", "response = requests.get(metadata_url, timeout=60)\n", "data = response.json()\n", "print(f'Loaded metadata file with {len(data):,} entries.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next let's convert to pandas DataFrame and print the available properties" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State_region, Locality, Date, Source_collection, Location_text, Date_text, Genre, Original_format, Language, Notes, Repository, Type_of_resource, Digitized, Url, Shelf_id, Directory_type, Location\n" ] } ], "source": [ "df = pd.DataFrame(data)\n", "print(', '.join(df.columns.to_list()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next print the top 10 most frequent locations in this dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | State_region | \n", "
---|---|
Massachusetts | \n", "5775 | \n", "
New York | \n", "4334 | \n", "
Pennsylvania | \n", "3364 | \n", "
Ohio | \n", "2853 | \n", "
New Jersey | \n", "2763 | \n", "
California | \n", "2567 | \n", "
Michigan | \n", "2514 | \n", "
Illinois | \n", "2416 | \n", "
Connecticut | \n", "2188 | \n", "
Indiana | \n", "1990 | \n", "